This guide provides detailed steps on the problem scenarios and solution on Longhorn PVC mount failures.
Problem with multipathd service
In some cases, Longhorn fails to mount Persistent Volume Claims (PVC) to pods in a Kubernetes cluster. This issue is typically caused by conflicts with the multipathd service, which may mistakenly identify Longhorn volumes as being in use, preventing the filesystem from being created.
The multipathd service is responsible for managing multiple paths to the same storage device. When it incorrectly identifies a Longhorn volume as being in use, it blocks the filesystem creation process, resulting in mount failures.
You might encounter the following error message in your Kubernetes environment:
Error Message:
Warning FailedMount 12s (x6 over 28s) kubelet
MountVolume.MountDevice failed for volume "pvc-87285c92-26c4-40bd-842d-7f608d9db2d8":
rpc error: code = Internal desc = format of disk "/dev/longhorn/pvc-87285c92-26c4-40bd-842d-7f608d9db2d8" failed:
type: ("ext4")
target: ("/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/1e70ad7ff7c1222b1d656429fcc03679fdfa8ed3d9ae0739e656b2e161bfc08d/globalmount")
options: ("defaults")
errcode: (exit status 1)
output: (
mke2fs 1.46.4 (18-Aug-2021)
/dev/longhorn/pvc-87285c92-26c4-40bd-842d-7f608d9db2d8 is apparently in use by the system; will not make a filesystem here!
)
Solution
Follow these steps to resolve the issue:
Step 1: Edit the multipath.conf File
- Open the multipath.conf file for editing:
vi /etc/multipath.conf
- Add the Configuration.
- Add the following configuration to
multipath.conf
file on all nodes in the cluster:blacklist { devnode "^sd[a-z0-9]+" }
- After adding the configuration, the file should look like this:
defaults { user_friendly_names yes } blacklist { devnode "^sd[a-z0-9]+" }
- Add the following configuration to
Step 2: Restart the multipathd.service
After the multipath.conf
file update, restart the multipathd service on all nodes in the cluster. Use the below command to restart it.
systemctl restart multipathd.service
Step 3: Delete and Recreate the Affected Pods
To apply the changes and resolve the issue, delete the affected pods so that Kubernetes can recreate them with the corrected configuration:
kubectl delete pod nextgen-gw-0 nextgen-gw-redis-master-0
Problem with longhorn file corruption
- Longhorn cannot remount the volume when the Longhorn volume has a corrupted filesystem. The workload then fails to restart as a result of this.
- Longhorn cannot fix this automatically. You will need to resolve this manually when this happens.
You might encounter the following error message in your Kubernetes environment:
Error Message:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 56s (x5809 over 8d) kubelet MountVolume.MountDevice failed for volume "pvc-b3ca140a-dab9-49f6-9f39-063594e58521" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 but could not correct them: fsck from util-linux 2.39.3 /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 contains a file system with errors, check forced. /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521: Unattached inode 1555
Solution
Follow these steps to resolve the issue:
Step 1: Identify the Node Running the Pod
Run the following command to find the node where the gateway pod is running:
kubectl get pods –o wide
Sample Responce:
root@opsramp-gateway:/home/gateway-admin# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nextgen-gw-0 0/3 ContainerCreating 0 12m 10.42.0.31 opsram-pgateway <none> <none>
nextgen-gw-redis-master-0 1/1 Running 0 25m 10.42.0.29 opsramp-gateway <none> <none>
From this output, we see that the gateway pod is running on the opsramp-gateway node.
Step 2: Login to the node and fix the file corruption issue
Log in to the node (opsramp-gateway) where the pod is running. Then, run the following command to repair the corrupted filesystem:
fsck –y <file-path>
Note
Obtain the<file-path>
by describing the pod:Kubectl describe pod nextgen-gw-0
Sample Responce:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 56s (x5809 over 8d) kubelet MountVolume.MountDevice failed for volume "pvc-b3ca140a-dab9-49f6-9f39-063594e58521" : rpc error: code = Internal desc = 'fsck' found errors on device /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 but could not correct them: fsck from util-linux 2.39.3
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 contains a file system with errors, check forced.
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521: Unattached inode 1555
In this case, the file path is /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521
, so run:
fsck –y /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521
Step 3: Delete the Affected Pods
To apply the fixes, delete the affected pod so Kubernetes can recreate it:
kubectl delete pod nextgen-gw-0
If multiple pods are affected, repeat the deletion process for each.